現代高效能運算面臨一個根本性的 「記憶體壁壘」:計算吞吐量(每秒浮點運算次數,FLOPS)的爆炸性成長,遠遠超過了記憶體頻寬的微小提升 全域記憶體 頻寬。這種差異使得大型多核心陣列變成了「飢餓」的處理器,只能等待資料傳輸。
1. 頻寬差距
雖然GPU每秒可執行數兆次運算,但通往記憶體(DRAM)的物理路徑受到接腳密度與電力需求的限制。 記憶體作為平行化的主要限制因素 表示當併發執行緒數增加時,每個執行緒的頻寬會下降,導致硬體處於停滯狀態,無法有效運作。
2. 廚房類比
想像一個先進的廚房(即GPU核心),每小時可製作1,000份餐點。然而,食材存放在五英里外的倉庫(全域記憶體)中,且只有一輛送貨機車(記憶體匯流排)。不管聘請多少主廚,產出上限仍由機車的速度決定。
3. 架構對比
一般 多核心CPU系統 利用龐大的快取來隱藏少量繁重執行緒的延遲。然而,大型平行架構卻持續面臨大量併發請求造成的「交通擁塞」。 資源限制 在暫存器與共用記憶體層級的資源限制,決定了硬體過載前所能達到的最大平行度(佔用率)。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
What is the primary cause of the 'Memory Wall' in modern GPU computing?
The clock speed of cores is too slow to process DRAM data.
Computational throughput (FLOPS) has increased much faster than memory bandwidth.
Shared memory is too large for the hardware to manage.
Global memory has higher latency than CPU registers.
✅ Correct!
This gap creates a bottleneck where processors spend most of their time waiting for data delivery.❌ Incorrect
The issue is the growth rate discrepancy between processing speed and the physical bandwidth of the memory bus.QUESTION 2
In the 'Kitchen Analogy,' what does the delivery scooter represent?
The GPU Core/Chef.
The Register File.
The Global Memory Bus.
The Operating System Scheduler.
✅ Correct!
The bus (scooter) is the narrow pipe that limits how fast 'ingredients' (data) reach the 'chefs' (cores).❌ Incorrect
The scooter represents the transmission medium, not the compute resource or the storage itself.QUESTION 3
How do resource limitations like register count affect parallelism?
They increase the speed of each individual thread.
They limit occupancy by reducing the number of active threads that can reside on an SM.
They have no effect on throughput, only on power consumption.
They bypass the need for global memory access.
✅ Correct!
Since hardware has a fixed pool of registers, using more registers per thread forces the GPU to run fewer concurrent threads.❌ Incorrect
Exceeding per-thread resource limits directly lowers 'occupancy,' meaning fewer threads are available to hide memory latency.QUESTION 4
When a kernel is in the 'Memory Bound' region of the Roofline Model, what is the best way to improve performance?
Increase the number of floating-point operations per second.
Increase the arithmetic intensity (data reuse).
Decrease the number of threads per block.
Add more complex branching logic.
✅ Correct!
Increasing arithmetic intensity (reusing data from shared memory) moves the kernel closer to the compute-bound plateau.❌ Incorrect
In a memory-bound state, adding more math won't help if the bottleneck is fetching data from memory.QUESTION 5
Why is implicit synchronization unreliable in massively parallel architectures?
Hardware evolution means threads within a warp may not stay locked in SIMT fashion.
Shared memory is too fast for synchronization to matter.
Global memory access is always synchronous.
Threads are processed sequentially in blocks.
✅ Correct!
Always use `__syncthreads()` to ensure data consistency, as hardware execution order is not guaranteed.❌ Incorrect
Relying on warp-level timing is dangerous; explicit barriers are mandatory for correctness in shared memory access.Case Study: Memory Optimization Audit
Analyzing Matrix Operations
You are auditing two kernels: Kernel A performs simple Matrix Addition ($C = A + B$). Kernel B performs Matrix Multiplication ($C = A \times B$). You apply Shared Memory Tiling to both.
Q
1. Which kernel will see a significant reduction in global memory bandwidth consumption after tiling?
Solution:
Kernel B (Matrix Multiplication). In multiplication, each element is used multiple times by different threads, allowing reuse via tiling. In addition, each element is accessed exactly once by one thread, so tiling offers no reuse benefit.
Kernel B (Matrix Multiplication). In multiplication, each element is used multiple times by different threads, allowing reuse via tiling. In addition, each element is accessed exactly once by one thread, so tiling offers no reuse benefit.
Q
2. If an SM has 8,192 registers and a thread limit of 768, what is the maximum registers a thread can use to maintain 100% occupancy?
Solution:
$8,192 / 768 \approx 10$ registers per thread. If a kernel uses 11 registers, the occupancy will drop because the SM cannot fit all 768 threads simultaneously.
$8,192 / 768 \approx 10$ registers per thread. If a kernel uses 11 registers, the occupancy will drop because the SM cannot fit all 768 threads simultaneously.
Q
3. Explain the risk of a Read-After-Write (RAW) hazard if `__syncthreads()` is omitted after loading a tile.
Solution:
Without the barrier, a thread might attempt to perform a calculation using a value in shared memory before the thread responsible for loading that specific value has actually finished writing it from global memory.
Without the barrier, a thread might attempt to perform a calculation using a value in shared memory before the thread responsible for loading that specific value has actually finished writing it from global memory.